MASSACHUSETTS INSTITUTE OF TECHNOLOGY 
ARTIFICIAL INTELLIGENCE LABORATORY 

and 

CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING 

DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES 

A.I. Memo No. 1441 August 6, 1993 

C.B.C.L. Memo No. 84 

On the Convergence of Stochastic Iterative 
Dynamic Programming Algorithms 

To rami Jaakkola, Michael I. Jordan and Satinder P. Singh 

Abstract 

Recent developments in the area of reinforcement learning have yielded a number of new algorithms for 
the prediction and control of Markovian environments. These algorithms, including the TD(A) algorithm 
of Sutton (1988) and the Q-learning algorithm of Watkins (1989), can be motivated heuristically as ap- 
proximations to dynamic programming (DP). In this paper we provide a rigorous proof of convergence 
of these DP-based learning algorithms by relating them to the powerful techniques of stochastic approx- 
imation theory via a new convergence theorem. The theorem establishes a general class of convergent 
algorithms to which both TD(A) and Q-learning belong. 
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An important component of many real world learning problems is the temporal credit as- 
signment problem — the problem of assigning credit or blame to individual components of a 
temporally-extended plan of action, based on the success or failure of the plan as a whole. To 
solve such a problem, the learner must be equipped with the ability to assess the long-term 
consequences of particular choices of action and must be willing to forego an immediate payoff 
for the prospect of a longer term gain. Moreover, because most real world problems involving 
prediction of the future consequences of actions involve substantial uncertainty, the learner 
must be prepared to make use of a probability calculus for assessing and comparing actions. 

There has been increasing interest in the temporal credit assignment problem, due princi- 
pally to the development of learning algorithms based on the theory of dynamic programming 
(DP) (Barto, Sutton, k Watkins, 1990; Werbos, 1992). Sutton's (1988) TD(A) algorithm 
addressed the problem of learning to predict in a Markov environment, utilizing a temporal 
difference operator to update the predictions. Watkins' (1989) Q-learning algorithm extended 
Sutton's work to control problems, and also clarified the ties to dynamic programming. 

In the current paper, our concern is with the stochastic convergence of DP-based learning 
algorithms. Although Watkins (1989) and Watkins and Dayan (1992) proved that Q-learning 
converges with probability one, and Dayan (1992) observed that TD(0) is a special case of Q- 
learning and therefore also converges with probability one, these proofs rely on a construction 
that is particular to Q-learning and fail to reveal the ties of Q-learning to the broad theory of 
stochastic approximation (e.g., Wasan, 1969). Our goal here is to provide a simpler proof of 
convergence for Q-learning by making direct use of stochastic approximation theory. We also 
show that our proof extends to TD(A) for arbitrary A. Several other authors have recently 
presented results that are similar to those presented here: Dayan and Sejnowski (1993) for 
TD(A), Peng and Williams (1993) for TD(A), and Tsitsiklis (1993) for Q-learning. Our results 
appear to be closest to those of Tsitsiklis (1993). 

We begin with a general overview of Markovian decision problems and DP. We introduce 
the Q-learning algorithm as a stochastic form of DP. We then present a proof of convergence 
for a general class of stochastic processes of which Q-learning is a special case. We then discuss 
TD(A) and show that it is also a special case of our theorem. 

Markovian decision problems 

A useful mathematical model of temporal credit assignment problems, studied in stochastic 
control theory (Aoki, 1967) and operations research (Ross, 1970), is the Markovian decision 
problem. Markovian decision problems are built on the formalism of controlled Markov chains. 
Let S = 1, 2, . . . , N be a discrete state space and let U(i) be the discrete set of actions available 
to the learner when the chain is in state i. The probability of making a transition from state i 
to state j is given by pij(u), where u £ U(i). The learner defines a policy //, which is a function 
from states to actions. Associated with every policy ^ is a Markov chain defined by the state 
transition probabilities pij(ji(i)). 

There is an instantaneous cost c 8 (m) associated with each state i and action u, where c 8 (m) 
is a random variable with expected value c 8 (m). We also define a value function V^i), which is 
the expected sum of discounted future costs given that the system begins in state i and follows 

policy ii: 

JV-l 

V,(i) = lim E{J2 I'csMstVlso = i}, (1) 

where s t £ S is the state of the Markov chain at time t. Future costs are discounted by a factor 



7*, where 7 £ (0, 1). We wish to find a policy that minimizes the value function: 

V*(i) = mmV li (i). (2) 

Such a policy is referred to as an optimal policy and the corresponding value function is referred 
to as the optimal value function. Note that the optimal value function is unique, but an optimal 
policy need not be unique. 

Markovian decision problems can be solved by dynamic programming (Bertsekas, 1987). 
The basis of the DP approach is an equation that characterizes the optimal value function. 
This equation, known as Bellman's equation, characterizes the optimal value of the state in 
terms of the optimal values of possible successor states: 

V*(i)= min{c 8 ( M ) + 7^p 8J ( M )T/*(j)}. (3) 

v ' jeS 

To motivate Bellman's equation, suppose that the system is in state i at time t and consider 
how V*(i) should be characterized in terms of possible transitions out of state i. Suppose that 
action u is selected and the system transitions to state j. The expression c 8 (m) + jV*(j) is 
the cost of making a transition out of state i plus the discounted cost of following an optimal 
policy thereafter. The minimum of the expected value of this expression, over possible choices 
of actions, seems a plausible measure of the optimal cost at i and by Bellman's equation is 
indeed equal to V*(i). 

There are a variety of computational techniques available for solving Bellman's equation. 
The technique that we focus on in the current paper is a iterative algorithm known as value 
iteration. Value iteration solves for V*(i) by setting up a recurrence relation for which Bellman's 
equation is a fixed point. Denoting the estimate of V*(i) at the k th iteration as V^ k '(i), we 
have: 

y(*+D(i)= min{c l (u) + 1 Y,P^)y {k) U)} (4) 

ueu( t ) jes 

This iteration can be shown to converge to V*(i) for arbitrary initial V^°'(i) (Bertsekas, 1987). 
The proof is based on showing that the iteration from V^ k '(i) to V^ k+1 '(i) is a contraction 
mapping. That is, it can be shown that: 



max 



|y(*+!)(i) - V*(i)\ < 7 max|W fc )(i) - V*(i)\, (5) 



which implies that V^ k '(i) converges to V*(i) and also places an upper bound on the convergence 
rate. 

Watkins (1989) utilized an alternative notation for expressing Bellman's equation that is 
particularly convenient for deriving learning algorithms. Define the function Q*(i,u) to be the 
expression appearing inside the "min" operator of Bellman's equation: 

Q*(i,u) = c i (u) + 1 Y l Pii( u ) v *(J) ( 6 ) 

jes 

Using this notation Bellman's equation can be written as follows: 

V*(i)= mm Q*(i,u). (7) 

ueu(i) 

Moreover, value iteration can be expressed in terms of Q functions: 

Q( k +%,u) = ci(u) + TEP,(# W (J), (8) 

jes 



where V^ k '(i) is defined in terms of Q( k >(i, u) as foiiows: 

V (k \i)= min Q {k \i,u). (9) 

ueu(i) 

The mathematicai convenience obtained from using Q's rather than V 7 s derives from the fact 
that the minimization operator appears inside the expectation in Equation 8, whereas it appears 
outside the the expectation in Equation 4. This fact plays an important role in the convergence 
proof presented in this paper. 

The value iteration algorithm in Equation 4 or Equation 8 can also be executed asyn- 
chronously (Bertsekas & Tsitsiklis, 1989). In an asynchronous implementation, the update of 
the value of a particular state proceeds in parallel with the updates of the values of other states. 
Bertsekas & Tsitsiklis (1989) show that as long as each state is updated infinitely often and 
each action is tried an infinite number of times in each state, then the asynchronous algorithm 
eventually converges to the optimal value function. Moreover, asynchronous execution has the 
advantage that it is directly applicable to real-time Markovian decision problems (RTDP; Barto, 
Bradtke, & Singh, 1993). In a real-time setting, the system uses its evolving value function 
to choose control actions for an actual process and updates the values of the states along the 
trajectory followed by the process. 

Dynamic programming serves as a starting point for deriving a variety of learning algorithms 
for systems that interact with Markovian environments (Barto, Bradtke, & Singh, 1993; Sutton, 
1988; Watkins, 1989). Indeed, real-time dynamic programming is arguably a form of learning 
algorithm as it stands. Although RTDP requires that the system possess a complete model of 
the environment (i.e., the probabilities Pij(u) and the expected costs c 8 (m) are assumed known), 
the performance of a system using RTDP improves over time, and its improvement is focused 
on the states that are actually visited. The system "learns" by transforming knowledge in one 
format (the model) into another format (the value function). 

A more difficult learning problem arises when the probabilistic structure of the environment 
is unknown. There are two approaches to dealing with this situation (cf. Barto, Bradtke, 
& Singh, 1993). An indirect approach acquires a model of the environment incrementally, by 
estimating the costs and the transition probabilities, and then uses this model in an ongoing DP 
computation. A direct method dispenses with constructing a model and attempts to estimate 
the optimal value function (or the optimal Q-values) directly. In the remainder of this paper, 
we focus on direct methods, in particular the Q-learning algorithm of Watkins (1989) and the 
TD(A) algorithm of Sutton (1988). 

The Q-learning algorithm is a stochastic form of value iteration. Consider Equation 8, which 
expresses the update of the Q values in terms of the Q values of successor states. To perform 
a step of value iteration requires knowing the expected costs and the transition probabilities. 
Although such a step cannot be performed without a model, it is nonetheless possible to estimate 
the appropriate update. For an arbitrary V function, the quantity J2jesPij( u )^(J) can be 
estimated by the quantity V(j), if successor state j is chosen with probability pij(u). But 
this is assured by simply following the transitions of the actual Markovian environment, which 
makes a transition from state i to state j with probability pij(u). Thus the sample value of V at 
the successor state is an unbiased estimate of the sum. Moreover c 8 (m) is an unbiased estimate 
of Ci(u). This reasoning leads to the following relaxation algorithm, where we use Q t (i,u) and 
Vt(i) to denote the learner's estimates of the Q function and V function at time t, respectively: 

Qt+i(s t , u t ) = (1 - a t (s t , u t ))Qt(s t , u t ) + a t (s t , u t )[c St (u t ) + -yV t (s t+1 )] (10) 



where 



V t (s t+1 ) = min Q t (s t ,u t ). (11) 

uEU(.s t+1 ) 



The variables a i (s i ,M i ) are zero except for the state that is being updated at time t. 

The fact that Q-learning is a stochastic form of value iteration immediately suggests the use 
of stochastic approximation theory, in particular the classical framework of Robbins and Monro 
(1951). Robbins-Monro theory treats the stochastic convergence of a sequence of unbiased 
estimates of a regression function, providing conditions under which the sequence converges to 
a root of the function. Although the stochastic convergence of Q-learning is not an immediate 
consequence of Robbins-Monro theory, the theory does provide results that can be adapted 
to studying the convergence of DP-based learning algorithms. In this paper we utilize a result 
from Dvoretzky's (1956) formulation of Robbins-Monro theory to prove the convergence of both 
Q-learning and TD(A). 

Convergence proof for Q-learning 

Our proof is based on the observation that the Q-learning algorithm can be viewed as a stochas- 
tic process to which techniques of stochastic approximation are generally applicable. Due to the 
lack of a formulation of stochastic approximation for the maximum norm, however, we need to 
slightly extend the standard results. This is accomplished by the following theorem the proof 
of which is given in Appendix A. 

Theorem 1 A random iterative process A n+ i(a;) = (1 — a n (x))A n (x) + fi n (x)F n (x) converges 
to zero w.p.l under the following assumptions: 

1) The state space is finite. 

2) J2n a n(x) = 00, J2n a K X ) < °°; J2n Pn(x) = 00, J2n Pl( x ) < °°; and V{f3 n (x) \P n } < 

Fj{a n (x)\P n } uniformly w.p.l. 

3) || E{F n (x)\P n } \\ w < 7 || A n \\ w , where 7 G (0, 1). 

4) Var{_F n (a;)|P n } < C(l+ || A n ||vf) 2 ; where C is some constant. 

Here P n = {A n , A n _i, . . . , F n -i, ■ ■ ■ , ot n -i, . . .,/3 n _i, . . .} stands for the past at step n. F n (x), 
a n (x) and fi n {x) are allowed to depend on the past insofar as the above conditions remain valid. 
The notation \\ ■ \\w refers to some weighted maximum norm. 

In applying the theorem, the A n process will generally represent the difference between a 
stochastic process of interest and some optimal value (e.g., the optimal value function). The 
formulation of the theorem therefore requires knowledge to be available about the optimal 
solution to the learning problem before it can be applied to any algorithm whose convergence is 
to be verified. In the case of Q-learning the required knowledge is available through the theory 
of DP and Bellman's equation in particular. 

The convergence of the Q-learning algorithm now follows easily by relating the algorithm to 
the converging stochastic process defined by Theorem l. 1 In the form of the theorem we have: 



We note that the theorem is more powerful than is needed to prove the convergence of Q-learning. Its 
generality, however, allows it to be applied to other algorithms as well (see the following section on TD(A). 



Theorem 2 The Q-learning algorithm given by 

Qt+i(s t ,u t ) = (1 - a t (st,Ut))Qt(st,u t ) + at(s t ,ut)[c st (u t ) + jVt(s t+1 )] 
converges to the optimal Q*(s,u) values if 

1) The state and action spaces are finite. 

2) J2t a t( s i u ) = °° an, d J2t a t( s i u ) < °° uniformly w.p.l. 

3) Var{c s (M)} is bounded. 

3) If j = 1 all policies lead to a cost free terminal state w.p.l. 

Proof. By subtracting Q*(s,u) from both sides of the learning rule and by defining 
Aj(s, u) = Qt(s, u) — Q*(s, u) together with 

F t (s, u) = c s (u) + -yV t (s next ) - Q*(s, u) (12) 

the Q-learning algorithm can be seen to have the form of the process in theorem 1 with fi t {s,u) = 
a t (s,u). 

To verify that F t (s, u) has the required properties we begin by showing that it is a contraction 
mapping with respect to some maximum norm. This is done by relating F t to the DP value 
iteration operator for the same Markov chain. More specifically, 

max|E{F t (i,u)}| = 1 m^\Y d p i j{u)[V t {j)-V*{j)]\ 

3 

< 7 max^ Pij(u) max \Q t (j, v) - Q*(j, v)\ 

3 
3 

where T is the DP value iteration operator for the case where the costs associated with each 
state are zero. If 7 < 1 the contraction property of T and thus of F t can be seen directly 
from the above formulas. When the future costs are not discounted (7 = 1) but the chain is 
absorbing and all policies lead to the terminal state w.p.l there still exists a weighted maximum 
norm with respect to which T is a contraction mapping (see e.g. Bertsekas & Tsitsiklis, 1989). 

The variance of F t (s, u) given the past is within the bounds of theorem 1 as it depends on 
Qt(s,u) at most linearly and the variance of c s (u) is bounded. 

Note that the proof covers both the on-line and batch versions. □ 



The TD(A) algorithm 

The TD(A) (Sutton, 1988) is also a DP-based learning algorithm that is naturally defined in 
a Markovian environment. Unlike Q-learning, however, TD does not involve decision-making 
tasks but rather predictions about the future costs of an evolving system. TD(A) converges to 
the same predictions as a version of Q-learning in which there is only one action available at 
each state, but the algorithms are derived from slightly different grounds and their behavioral 

5 



differences are not weii understood. In this section we introduce the algorithm and its derivation. 
The proof of convergence is given in the following section. 

Let us define Vt(i) to be the current estimate of the expected cost incurred during the 
evolution of the system starting from state i and let c 8 - denote the instantaneous random cost 
at state i. As in the case of Q-learning we assume that the future costs are discounted at each 
state by a factor 7. If no discounting takes place (7 = 1) we need to assume that the Markov 
chain is absorbing, that is, there exists a cost free terminal state to which the system converges 
with probability one. 

We are concerned with estimating the future costs that the learner has to incur. One way 
to achieve these predictions is to simply observe n consecutive random costs weighted by the 
discount factor and to add the best estimate of the costs thereafter. This gives us the estimate 

V} n \i t ) = c H + 7 c,- t+1 + 7 2 c,- t+2 + . . . + t"" 1 ^^.! + l n Vt(it+n) (13) 

The expected value of this can be shown to be a strictly better estimate than the current 
estimate is (Watkins, 1989). In the undiscounted case this holds only when n is larger than 
some chain-dependent constant. To demonstrate this let us replace Vj with V* in the above 
formula giving E{V* (it)} = V*(it) which implies 

max\E{V t {n) (i)} - V*(i)\ < 7 n maxPr{m 8 > n} max \V t (i) - V*(i)\ (14) 

i i i 

where m 8 - is the number of steps in a sequence that begins in state i (infinite in the non- 
absorbing case). This implies that if either 7 < 1 or n is large enough so that the chain 

(n) 

can terminate before n steps starting from an arbitrary initial state then the estimate V t is 
strictly better than Vj. In general, the larger n the more unbiased the estimate is as the effect 
of incorrect Vt vanishes. However, larger n increases the variance of the estimate as there are 
more (independent) terms in the sum. 

Despite the error reduction property of the truncated estimate it is difficult to calculate in 
practice as one would have to wait n steps before the predictions could be updated. In addition 
it clearly has a huge variance. A remedy to these problems is obtained by constructing a new 
estimate by averaging over the truncated predictions. TD(A) is based on taking the geometric 

average: 

00 

v; A (0 = (i-A)£A n_1 v; (n) (0 (is) 

71 = 1 

As a weighted average it is still a strictly better estimate than Vt(i) with the additional benefit 
of being better in the undiscounted case as well (as the summation extends to infinity). Fur- 
thermore, we have introduced a new parameter A which affects the trade-off between the bias 
and variance of the estimate (Watkins, 1989). An increase in A puts more weight on less biased 
estimates with higher variances and thus the bias in V t decreases at the expense of a higher 
variance. 

The mathematical convenience of using the geometric average can be seen as follows. Given 
the estimates V t x (i) the obvious way to use them in a learning rule is 

V t+1 (i t ) = Vt(it) + a[V t \i t ) - Vt(it)} (16) 

In terms of prediction differences, that is 

A t (i t ) = c H + ~fV t (i t+1 ) - Vt(it) (17) 



the geometric weighting allows us to write the correction term in the learning rule as 

V t \n) - V t (i t ) = A t (i t ) + (A 7 )A t (* t+1 ) + (A 7 ) 2 A t (* t+2 ) + . . . (18) 

Note that up to now the prediction differences that need to be calculated in the future depend on 
the current Vt(i). If the chain is nonabsorbing this computational implausibility can, however, 
be overcome by updating the predictions at each step with the prediction differences calculated 
by using the current predictions. This procedure gives the on-line version of TD(A): 

t 
V t+1 (i) = V t (i) + a t A t (it) $>A)'- fc x,-(A0 (19) 

fc=o 

where Xi(k) 1S the indicator variable of whether state i was visited at k th step (of a sequence). 
Note that the sum contains the effect of the modifications or activity traces initiated at past time 
steps. Moreover, it is important to note that in this case the theoretically desirable properties 
of the estimates derived earlier may hold only asymptotically (see the convergence proof in the 
next section). 

In the absorbing case the estimates Vt(i) can also be updated off-line, that is, after a 
complete sequence has been observed. The learning rule for this case is derived simply from 
collecting the correction traces initiated at each step of the sequence. More concisely, the total 
correction is the sum of individual correction traces illustrated in eq. (18). This results in the 
batch learning rule 

m t 

V n+1 (i) = V n (i) + a n J2 A„(i t ) £ W*X.-(*0 (20) 

t=l k=0 

where the (to + 1) step is the termination state. 

We note that the above derivation of the TD(A) algorithm corresponds to the specific choice 
of a linear representation for the predictors Vt(i) (see, e.g., Dayan, 1992). Learning rules for 
other representations can be obtained using gradient descent but these are not considered here. 
In practice TD(A) is usually applied to an absorbing chain thus allowing the use of either the 
batch or the on-line version but the latter is usually preferred. 

Convergence of TD(A) 

As we are interested in strong forms of convergence we need to modify the algorithm slightly. 
The learning rate parameters a n are replaced by a n (i) which satisfy ^2 n ot n (i) = oo and 
2^ n a^(i) < oo uniformly w.p.l. These parameters allow asynchronous updating and they 
can, in general, be random variables. The convergence of the algorithm is guaranteed by the 
following theorem which is an application of Theorem 1. 

Theorem 3 For any finite absorbing Markov chain, for any distribution of starting states with 
no inaccessible states, and for any distributions of the costs with finite variances the TD(\) 
algorithm given by 

1) 

ra t 

V n+1 (i) = V n (i) + a n (i)J2[cn + lV n (i t+1 ) - V n (i t )] £(7A)*-*X.-(*0 

t=i k=i 



2) 

t 
V t+1 (i) = V t (i) + a t (i)[c H + 7^+1) - V t (i t )] J2(^y- k X t (k) 

k=i 

converges to the optimal predictions w.p.l provided J2 n a n(i) = °° an dJ2 n a n(i) < °° uniformly 
w.p.l and 7, A £ [0, 1] with 7 A < 1. 

Proof for (1): Using the ideas described in the previous section the learning rule can be 
written as 

V n+1 (i) = V n (i) + a n (i)[G n (i) - -J^lLy n (0] 

where V^(i; k) is an estimate calculated at the k occurence of state i in a sequence and for 
mathematical convenience we have made the transformation a n (i) — ► ~E{m(i)}a n (i), where m(i) 
is the number of times state i was visited during the sequence. 

To apply Theorem 1 we subtract V*(i), the optimal predictions, from both sides of the 
learning equation. By identifying a n (i) := a n (i)m(i)/E{m(i)}, fi n {i) := a n (i), and F n (i) : = 
G n (i) — V* '(i)m(i) /E{m(i)} we need to show that these satisfy the conditions of Theorem 1. 
For a n (i) and (i n {i) this is obvious. We begin here by showing that F n (i) indeed is a contraction 
mapping. To this end, 

max|E{F„(i)| K}| = 

% 

max I ¥TTS\ E ^ V -^ X) " V * {l)) + (V ^ 2) " V * {l)) + ■ ■ ■ I K}l 

which can be bounded above by using the relation 

|E{V„ A (i;A0-V*(i)|K}| 

< E { |E{K A (*; k) - V*(i) I m(i) > k, V n }\0(m(i) - k) \ V n ) 

< P{m(t)>k}\E{V n x (t)-V*(t)\V n }\ 

< jP{m(i)>k}m.asi\V n (i)-V*(i)\ 

i 

where 6(x) = if x < and 1 otherwise. Here we have also used the fact that V^(i) is a 
contraction mapping independent of possible discounting. As J2k P{_fn{x) > k} = E{m(i)} we 
finally get 

max \E{F n (i) \ V n }\ < 7 max \V n (i) - V*(i)\ 

The variance of F n (i) can be seen to be bounded by 

E{m 4 }max|K(0! 2 

i 

For any absorbing Markov chain the convergence to the terminal state is geometric and thus 
for every finite k, E{ra } < C{k), implying that the variance of F n (i) is within the bounds of 



theorem 1. As Theorem 1 is now applicable we can conclude that the batch version of TD(A) 
converges to the optimal predictions w.p.l. □ 

Proof for (2) The proof for the on-line version is achieved by showing that the effect 
of the on-line updating vanishes in the limit thereby forcing the two versions to be equal 
asymptotically. We view the on-line version as a batch algorithm in which the updates are 
made after each complete sequence but are made in such a manner so as to be equal to those 
made on-line. 

Define G n {i) = G n (i) + R n (i) to be the new batch estimate where R n (i) is the difference 
between the on-line and batch estimates. We define the new batch learning parameters to 
be the maxima over a sequence, that is a n (i) = m&x te s a t(i)- Now R n (i) consists of terms 
proportional to 

[c t + -yV n (i t+ i) - V n (i t )] 

the expected value of which can be bounded by A = 2 || V n — V* ||. Assuming that 7 A < 1 
(which implies that the multipliers of the above terms are bounded) we can get an upper bound 
for the expected value of the correction R n (i). Let us define R n ^ to be the expected difference 
between the on-line estimate after t steps and the first t terms of the batch estimate. We can 
bound R nt t(i) readily by the update rule resulting in the iteration 

II Rn,t+i \\<\\ a n II C(A+ || R nit ||) 

where R n , n {i) = E{R n (i) | V n }, R n ,o(i) = 0, and C is some constant. Since || a n || goes to zero 
w.p.l the above iteration implies that || R n;a ||— ► w.p.l giving 

max \E{R n (i) \V n }\< C n max \V n (i) - V*(i)\ 

where C n — ► w.p.l. Therefore using the results for the batch algorithm, F n (i) = G n {i) — 
V* '(i)m(i) /E{ra(i)} satisfies 

max \E{F' n (i)}\ < (7 + C n ) max \V n (i) - V*(i)\ 

where for large n (7 + C n ) < 7' < 1 w.p.l. The variance of R n (i) and thereby that of F n (i) are 
within the bounds of theorem 1 by linearity. This completes the proof. □ 

Conclusions 

In this paper we have extended results from stochastic approximation theory to cover asyn- 
chronous relaxation processes which have a contraction property with respect to some maximum 
norm (Theorem 1). This new class of converging iterative processes is shown to include both 
the Q-learning and TD(A) algorithms in either their on-line or batch versions. We note that 
the convergence of the on-line version of TD(A) has not been shown previously. We also wish 
to emphasize the simplicity of our results. The convergence proofs for Q-learning and TD(A) 
utilize only high-level statistical properties of the estimates used in these algorithms and do not 
rely on constructions specific to the algorithms. Our approach also sheds additional light on 
the similarities between Q-learning and TD(A). 

Although Theorem 1 is readily applicable to DP-based learning schemes, the theory of 
Dynamic Programming is important only for its characterization of the optimal solution and 
for a contraction property needed in applying the theorem. The theorem can be applied to 
iterative algorithms of different types as well. 
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Finally we note that Theorem 1 can be extended to cover processes that do not show the 
usual contraction property thereby increasing its applicability to algorithms of possibly more 
practical importance. 

Proof of Theorem 1 

In this section we provide a detailed proof of the theorem on which the convergence proofs for 
Q-learning and TD(A) were based. We introduce and prove three essential lemmas, which will 
also help to clarify ties to the literature and the ideas behind the theorem, followed by the proof 
of Theorem 1. The notation || • \\w= max^ | • /W(x)\ will be used in what follows. 

Lemma 1 A random process 

w n+1 (x) = (1 - a n {x))w n {x) + fi n (x)r n (x). 

converges to zero with probability one if the following conditions are satisfied: 

1) J2n a n(x) = oo, J2n a l( x ) < °°; J2 n Pn(x) = oo, and J2 n Pl( x ) < °° uniformly w.p.l. 

2) E{r n (x)\P n } = and E{r 2 n (x)\P n } < C w.p.l, where 

Pn = K,%-lr • -,^n-l,^n-2, • • • , «n-l , «n-2, • • • , Pn-1 , Pn-2, ■ ■ ■} 

All the random variables are allowed to depend on the past P n . 

Proof. Except for the appearance of fi n {x) this is a standard result. With the above 
definitions convergence follows directly from Dvoretzky's extended theorem (Dvoretzky, 1956). 

Lemma 2 Consider a random process X n+ i(x) = G n (X n ,x), where 

G n (/3X n ,x) = f3G n (X n ,x) 

Let us suppose that if we kept || X n || bounded by scaling, then X n would converge to zero w.p.l. 
This assumption is sufficient to guarantee that the original process converges to zero w.p.l. 

Proof. Note that the scaling of X n at any point of the iteration corresponds to having 
started the process with scaled Xq. Fix some constant C. Lf during the iteration, || X n || 
increases above C, then X n is scaled so that || X n ||= C. By the assumption then this process 
must converge w.p.l. To show that the net effect of the corrections must stay finite w.p.l we 
note that if || X n || converges then for any e > there exists M e such that || X n ||< e < C for 
all n > M e with probability at least 1 — e. But this implies that the iteration stays below C 
after M e and converges to zero without any further corrections. □ 

Lemma 3 A stochastic process X n+ i(x) = (1 — a(x))X n (x) -\-^fi n (x) || X n || converges to zero 
w.p.l provided 

1) x G S , where S is a finite set. 

2) J2n a n(x) = 00, J2n a K X ) < °°; J2n Pn(x) = 00, En/^n^) < °°; and E{f3 n (x)} < 

Fj{a n (x)} uniformly w.p.l. 
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Proof. Essentially the proof is an application of Lemma 2. To this end, assume that we 
keep || X n || < C\ by scaling which allows the iterative process to be bounded by 

\X n+ i(x)\ < (1 - a n (x))\X n (x)\ + 1 fi n (x)C 1 

This is linear in |X n (a;)| and can be easily shown to converge w.p.l to some X*(x), where 
II ^* ||< lC\. Hence, for small enough e, there exists Mi(e) such that || X n \\< C\/(l + e) 
for all n > Mi(e) with probability at least Pi(e). With probability pi(e) the procedure can be 
repeated for C'2 = Ci/(1 + e). Continuing in this manner and choosing Pk{e) so that Y\kPk{ € ) 
goes to one as e — ► we obtain the w.p.l convergence of the bounded iteration and Lemma 2 
can be applied. □ 

Theorem 1 A random iterative process A n+ i(a;) = (1 — a n (x))A n (x) + fi n (x)F n (x) converges 
to zero w.p.l under the following assumptions: 

1) The state space is finite. 

2) J2n a n(x) = 00, J2n a K X ) < °°; J2n Pn(x) = 00, J2n Pl( x ) < °°; and V{f3 n (x) \P n } < 

Fj{a n (x)\P n } uniformly w.p.l. 

3) || E{F n (x)\P n } \\ w < 7 || A n \\ w , where 7 G (0, 1). 

4) X&r{F n (x)\P n } < C(l+ || A n ||vf) 2 ; where C is some constant. 

Here P n = {X n , X n _i, . . . , F n -i, . . . , a n -i 5 • • • , Pn-i-, ■ ■ •} stands for the past at step n. F n (x), 
a n (x) and fi n {x) are allowed to depend on the past insofar as the above conditions remain valid. 
The notation \\ ■ \\w refers to some weighted maximum norm. 

Proof. By defining r n (x) = F n (x) — ~Ej{F n (x)\P n } we can decompose the iterative process 
into two parallel processes given by 

6 n+1 (x) = (1 - a n (x))S n (x) + (3 n (x)E{F n (x)\P n } 
vj n+1 (x) = (1 - a n (x))vj n (x) + fi n (x)r n (x) (21) 

where A n (x) = S n (x) + w n (x). Dividing the equations by W{x) for each x and denoting 
6 n (x) = S n (x)/W(x), w n (x) = w n (x)/W(x), and r n (x) = r n (x)/W(x) we can bound the S n 
process by assumption 3) and rewrite the equation pair as 

iCi-iMI < (l-a n (x))\S' n (x)\+-ff3 n (x) || \S'\ + w' n || 
w' n+1 {x) = (l-a n (x))w' n (x)+j/3 n (x)r' n (x) 

Assume for a moment that the A n process stays bounded. Then the variance of r n (x) is 
bounded by some constant C and thereby w n converges to zero w.p.l according to Lemma 1. 
Hence, there exists M such that for all n > M \\ w n ||< e with probability at least 1 — e. This 
implies that the S n process can be further bounded by 

iCi-iMI < C 1 - a n( x ))\S' n (x)\ + "ifi n (x) \\b' n + e \\ 

with probability > 1 — e. If we choose C such that 7(6* + 1)/C < 1 then for || S n ||> Ce 

7ll< + e||<7(C + l)/C||<|| 
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and the process defined by this upper bound converges to zero w.p.l by Lemma 3. Thus || 8 n || 
converges w.p.l to some value bounded by Ce which guarantees the w.p.l convergence of the 
original process under the boundedness assumption. 

By assumption (4) r n (x) can be written as (1+ || 8 n + w n \\)s n (x), where F J {s^ l (x)\P n } < C. 
Let us now decompose w n as u n + v n with 



l n+l 



(x) = (1 - a n (x))u n (x) + i(3 n (x) || S n + u n + v n \\ s n (x) 



and v n converges to zero w.p.l by Lemma 1. Again by choosing C such that j(C + 1)/C < 1 
we can bound the 8 n and u n processes for || 8 n + u n ||> Ce. The pair (8 n , u n ) is then a 
scale invariant process whose bounded version was proven earlier to converge to zero w.p.l and 
therefore by Lemma 2 it too converges to zero w.p.l. This proves the w.p.l convergence of the 
triple 8 n , u n , and v n bounding the original process. □ 
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